Predicting prices based on house features

For the approach, we are going to follow the machine learning workflow as attached above.
This problem is a linear regression problem because it involves predicting a value. So, let's look at machine learning algorithms cheatsheet provided by SAS and Microsoft.
From the cheatsheets above, we can try several algorithms:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
plt.style.use('ggplot')
print("numpy version {}".format(np.__version__))
print("pandas version {}".format(pd.__version__))
print("sklearn version {}".format(sklearn.__version__))
data = pd.read_csv("./data/kc_house_data.csv")
data.head()
data.info()
data.describe()
df = data.iloc[:]
df.info()
Insights:
df['date'] = pd.to_datetime(df['date'])
df.head()
def val_count (col):
'''function to get uniquevalue count for any column'''
display(df[col].value_counts())
val_count("bathrooms")
According to an article from www.realtor.com, a full bathroom is made up of four parts: a sink, a shower, a bathtub, and a toilet. Hence, the bathrooms column refers to the number of bathrooms where the decimal points means that there are bathrooms without one/some of the four parts.
Because the decimal points are there for a valid reason, we are not going to round this up.
df.isnull().sum()
There is no null values in the dataframe.
# check for duplicate rows based on id
duplicates = df[df.duplicated(subset=['id'], keep=False)]
print("Number of rows with duplicated id {}".format(len(duplicates)))
display(duplicates.sort_values(by=["id"]).head())
# check for duplicate rows based on id
duplicates = df[df.duplicated(subset=['id', 'date'], keep=False)]
print("Number of rows with duplicated id and date {}".format(len(duplicates)))
duplicates.sort_values(by=["id"]).head(10)
Firstly, we checked duplicated rows based on id and there were 210 duplicated rows. After looking at the duplicated rows, we can see that the same house could be sold twice. Therefore, we can check for duplicated rows again based on both id and date because it is not possible to sell the house twice on the same day.
We have found that there is no duplicated rows in the dataframe.
We can make a new feature called resold based on this information.
df.hist(figsize = (15,18))
plt.show()
Insights:
bathrooms, condition, floors, and grade are distributed evenly.bedrooms, price, sqft_living, sqft_lot, and sqft_lot15.view and yr_renovated, we can investigate further to see if it is better to represent these values as boolean values.Let's examine bedrooms first
val_count("bedrooms")
import plotly.express as px
fig = px.box(df, y="bedrooms")
fig.show()
fig = px.scatter(data_frame=df, x=df["bedrooms"], y=df["sqft_living"], color="price", title="Scatter Plot of Living Area vs Bedrooms")
fig.show()
df[df["bedrooms"] > 7].sort_values(by=["bedrooms"])
Considering the relatively low living space (1620), price ($640k), and bathrooms for the house with 33 bedrooms, there is a possibility that it is not a valid number. Therefore, we are going to drop it.
df = df.drop(index = df[df["bedrooms"] > 15].index)
Next, we are going to look at price
fig = px.box(df, y="price", title="price boxplot")
fig.show()
import plotly.express as px
fig = px.histogram(data, title="price histogram", x="price")
fig.show()
fig = px.scatter(data_frame=df,
x=df["sqft_living"],
y=df["price"],
title="Scatter Plot of Living Area vs Price")
fig.show()
percentile99 = df["price"].quantile(q=0.99)
print("99 percentile: {}".format(percentile99))
above99 = df[df["price"] > percentile99].sort_values(by=["price"], ascending=False)
above99.head(10)
fig = px.scatter(above99, title="Prices vs Living Area of Houses Above 99 Percentile", x="sqft_living", y="price")
fig.show()
The houses with prices above the 99th percentile tend to have bigger living area. Because these houses are rare and it causes noise in our data, we can drop the outliers.
df = df.drop(index = df[df["price"] > percentile99].index)
sqft_lot
fig = px.box(df, y="sqft_lot")
fig.show()
lot_percentile99 = df["sqft_lot"].quantile(q=0.99)
print("99 percentile: {}".format(lot_percentile99))
lot_above99 = df[df["sqft_lot"] > lot_percentile99].sort_values(by=["sqft_lot"], ascending=False)
lot_above99.head(5)
After examining the zipcode on google maps satellite, we found out that the records with the highest sqft_lot seems to be located in a remote area with a lot of unused land. Hence, there is a possibility of these properties being used as farmlands.
yr_renovated
fig = px.histogram(x=df[df["yr_renovated"] > 1900]["yr_renovated"])
fig.show()
Renovations before 1990 would have less of an impact on the house anymore by 2014-2015. Therefore, we are going to make a new boolean feature renovated90, where the renovations done before year 1990 would have 0 and renovations after 1990 would be 1.
import plotly.express as px
fig = px.scatter_mapbox(df,
lat="lat",
lon="long",
hover_name="id",
hover_data=["price", "yr_built", "yr_renovated"],
color="price",
color_continuous_scale=px.colors.cyclical.IceFire,
zoom=8,
height=400)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
import plotly.express as px
fig = px.box(df, x="waterfront", y="price", title="Waterfront vs house prices")
fig.show()
Insights:
df['living_lot_ratio'] = df['sqft_living']/df['sqft_lot']
df['living_lot_ratio'].describe()
fig = px.scatter(df, x="living_lot_ratio", y="price", title="Living/Lot Ratio vs Price", trendline="ols")
fig.show()
Insights:
Now, Let's see the correlation between small living/lot ratio (small living space, large lot) and price.
df_1 = df[df['living_lot_ratio'] < 1]
fig = px.scatter(df_1, x="living_lot_ratio", y="price", title="Living/Lot Ratio < 1 vs Price", trendline="ols")
fig.show()
This graph shows that there is little to no correlation between having a large lot and the price.
# check for duplicate rows based on id
duplicates = df[df.duplicated(subset=['id'], keep='first')]
print("Number of rows with duplicated id {}".format(len(duplicates)))
duplicates.sort_values(by=["id"]).head(10)
# get true/false value for resold houses and concatenate to df
resold = pd.DataFrame(df.duplicated(subset=['id'], keep='first').astype(int), columns = ["resold"])
df = pd.concat([df, resold], axis=1)
df.head()
display(df[df['yr_renovated'] > 1990]['yr_renovated'].value_counts())
len(df[df['yr_renovated'] > 1990]['yr_renovated'])
df['yr_renovated'].map(lambda x: 1 if x > 1990 else 0).value_counts()
df['renovated90'] = df['yr_renovated'].map(lambda x: 1 if x > 1990 else 0)
df
df = df.drop(columns=["yr_renovated"])
df['month'] = pd.DatetimeIndex(df["date"]).month
df.head()
sns.countplot(data=df, x = df['month'])
plt.title("Month vs House Sold Frequency")
This is the age at the point the house was sold.
df['age'] = pd.DatetimeIndex(df["date"]).year - df['yr_built']
df = df.drop(columns=['yr_built'])
df
df = df.drop(columns=["id", "date", "lat", "long"])
# Correlation
corr = df.corr()
import plotly.figure_factory as ff
colorscale = 'rdbu'
font_colors = ['black']
heatmapdata = np.array(np.round(corr, 3))
x_axis = list(corr.columns)
y_axis = list(corr.index)
fig = ff.create_annotated_heatmap(heatmapdata, x=x_axis, y=y_axis, colorscale=colorscale, font_colors=font_colors)
for i in range(len(fig.layout.annotations)):
fig.layout.annotations[i].font.size = 10
fig.update_layout(
font=dict(
size=9,
),
height= 900,
title="Correlation Matrix"
)
fig.show()
Insights:
sqft_living and sqft_above are highly correlated because sqft_above is just sqft_living without sqft_basement and most houses does not have a basement. We will choose to drop sqft_above and keep sqft_living.
sqft_living and sqft_living15 are also highly correlated. We will choose to keep sqft_living again as it explains individual houses.
sqft_lot, sqft_lot15 are highly correlated. We will keep sqft_lot because it explains the inidividual houses.
df = df.drop(columns=["sqft_above", "sqft_living15", "sqft_lot15"])
def encode(df, discretecol):
dfonehot = df[discretecol].astype('category')
dfonehot = pd.get_dummies(dfonehot, prefix=discretecol, drop_first = True)
df = pd.concat([df, dfonehot], axis = 1)
df = df.drop(discretecol, axis = 1)
return df
df = encode(df, "bedrooms")
df = encode(df, "bathrooms")
df = encode(df, "view")
df = encode(df, "condition")
df = encode(df, "zipcode")
df = encode(df, "month")
df = encode(df, "age")
df.info()
X = df.drop('price', axis = 1)
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=16)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
df = pd.concat([X_train, y_train], axis = 1)
df.head()
df_test = pd.concat([X_test, y_test], axis = 1)
df_test.head()
X_train.info()
X_train
df_test.info()
Let's start with applying a simple linear regression
We are going to use 1 continuous feature for this, which is the sqft_living feature.
X_slr_train = np.array(df['sqft_living']).reshape(-1, 1)
y_train = df["price"]
import statsmodels.api as sm
X_int = sm.add_constant(X_slr_train)
model = sm.OLS(y_train, X_int).fit()
summary = model.summary()
summary
from sklearn import linear_model
linreg = linear_model.LinearRegression()
linreg.fit(X_slr_train, y_train)
from sklearn import model_selection
scores = model_selection.cross_val_score(
linreg,
X_slr_train,
y_train,
cv=10,
scoring="neg_mean_squared_error")
scores
rmse_scores = np.sqrt(-scores)
display(rmse_scores)
display(rmse_scores.mean())
display(rmse_scores.std())
y_slr_pred = linreg.predict(X_slr_train)
plt.figure(figsize = (10,10))
# scatter plot
plt.scatter(x = X_slr_train, y = y_train, alpha = 0.3, label = 'Data Points')
# line plot
plt.plot(X_slr_train, y_slr_pred, color="red", label = "Regression Line")
plt.legend()
plt.title('Simple linear regression using sqft_living feature')
plt.xlabel('Living area (sqft)')
plt.ylabel('Price (USD)')
plt.show()
rmse_scores = np.sqrt(-scores)
display(rmse_scores)
display(rmse_scores.mean())
display(rmse_scores.std())
# Define table as a dataframe with specific columns. Each column's values will be a list that we will add to.
evaluationtable = pd.DataFrame({'Model': [], 'Description':[], ' Num Features' : [], 'R-squared':[], 'RMSE': []})
# Add data for simple linear regression
evaluationtable.loc[0] = ['Simple Linear Regression', 'Sqft_living', model.df_model, round(model.rsquared,3), int(rmse_scores.mean()) ]
# View our evaluation table
evaluationtable
X_slr_test = np.array(df_test['sqft_living']).reshape(-1, 1)
y_test = df_test["price"]
y_slr_test_pred = linreg.predict(X_slr_test)
plt.figure(figsize = (10,10))
# scatter plot
plt.scatter(x = X_slr_test, y = y_test, alpha = 0.3, label = 'Data Points')
# line plot
plt.plot(X_slr_test, y_slr_test_pred, color="red", label = "Regression Line")
plt.legend()
plt.title('Simple linear regression using sqft_living feature')
plt.xlabel('Living area (sqft)')
plt.ylabel('Price (USD)')
plt.show()
X_dtr_train = X_train
y_dtr_train = y_train
from sklearn.tree import DecisionTreeRegressor
for i in range(100, 120):
dtr = DecisionTreeRegressor(random_state = 0, criterion="mse", splitter="best", max_depth=i)
dtr.fit(X_dtr_train, y_dtr_train)
y_dtr_pred = dtr.predict(X_dtr_train)
y_tot = 0
for y, y_pred in zip(y_dtr_train, y_dtr_pred):
y_se = (y-y_pred)**2
y_tot += y_se
y_mse = y_tot/len(y_dtr_pred)
print("max_depth {}: {}".format(i, y_mse))
We are going to take the value 80 as the max_depth because values above 80 overfits.
dtr = DecisionTreeRegressor(random_state = 0, criterion="mse", splitter="best", max_depth=90)
dtr
dtr.fit(X_dtr_train, y_dtr_train)
y_dtr_pred = dtr.predict(X_dtr_train)
plt.figure(figsize = (10,10))
# scatter plot
plt.scatter(x = X_dtr_train["sqft_living"], y = y_dtr_train, alpha = 0.3, label = 'Data Points')
# line plot
# plt.plot(X_dtr_train["sqft_living"], y_dtr_pred, color="red", label = "Regression Line")
plt.scatter(X_dtr_train["sqft_living"], y_dtr_pred, color="blue", alpha=0.3, label = "Regression Line")
plt.title('Decision Tree Regression y_pred Visualisation using sqft_living feature (Train)')
plt.xlabel('Living area (sqft)')
plt.ylabel('Price (USD)')
plt.show()
# test mse
y_tot = 0
for y, y_pred in zip(y_test, y_dtr_pred):
y_se = (y-y_pred)**2
y_tot += y_se
y_mse = y_tot/len(y_dtr_pred)
y_mse
y_test_dtr_pred = dtr.predict(X_test)
plt.figure(figsize = (10,10))
# scatter plot
plt.scatter(x = X_test["sqft_living"], y = y_test, alpha = 0.3, label = 'Data Points')
# line plot
# plt.plot(X_dtr_train["sqft_living"], y_dtr_pred, color="red", label = "Regression Line")
plt.scatter(X_test["sqft_living"], y_test_dtr_pred, color="blue", alpha=0.3, label = "Regression Line")
plt.title('Decision Tree Regression y_pred Visualisation using sqft_living feature (Test)')
plt.xlabel('Living area (sqft)')
plt.ylabel('Price (USD)')
plt.show()
from sklearn.metrics import r2_score
dtr_r2 = r2_score(y_test, y_test_dtr_pred)
rmse_scores = np.sqrt(y_mse)
# Add data for decision tree regression
evaluationtable.loc[1] = ['Decision Tree Regression', 'All Features', len(X_train.columns), round(dtr_r2,3), int(rmse_scores.mean())]
# View our evaluation table
evaluationtable
The prediction task is to predict the house prices based on house features.
Living/Lot Ratio feature -> sqft_living/sqft_lotresold -> If the house has been resold within the dataset (boolean)renovated90 -> renovated after 1990month -> the month where the house was soldage -> date(year) - yr_builtAccording to an article from www.realtor.com, a full bathroom is made up of four parts: a sink, a shower, a bathtub, and a toilet. Hence, the bathrooms column refers to the number of bathrooms where the decimal points means that there are bathrooms without one/some of the four parts.
I am using root mean square error as a metric to evaluate my systems.
It has higher $R^2$ value than the stupid baseline. 0.640 - 0.464 = 0.176 increase.
The sqft_living feature is important because it has high correlation with the price from the beginning.